\ SUNDAY QUICKSEARCH ALGORITHM
\ By Leonard Morgenstern nleonard@aol.com

\  Forth Scientific Library Algorithm #41

\ Adapted from Daniel Sunday. Communications of the ACM 33:132 1990.

\ QUICKSEARCH ( c-addrp up c-addrt  ut -- c-addr)
\ Locate the first instance of a search-pattern with coordinates c-addrp up within a
\ text-string c-addrt ut. Return the address of the first match in the text if found,
\ or NULL if not, where NULL is a user-defined impossible address. An ambiguous
\  condition exists if the search-pattern (up) is longer than the text-string (ut). 

\ The search algorithm

\ Before beginning the search, the algorithm constructs an array called a
\  shift-table, having the same number of cells as the number of possible characters,
\  normally 256. The contained quantity is up +1 (where up is the length of the
\  search-pattern) for characters not present in the search-pattern, and the number
\  of characters from the end of the string for characters that are present, the
\  final character in the string being counted as 1. If duplicates are present, the
\  lowest number is used. For example, if the pattern string is "THAT", then T=1,
\  A=2, H=3, and all others are 5. If two characters are equivalent, they are
\  assigned the same shift-values; in the example, T and t would both have the value
\  1 if case is to be ignored. The shift-table is the only data structure needed by
\  the algorithm, except, of course, the text and pattern strings themselves.

\  At the start, the first character of the pattern is positioned on the first
\  character of the text and up characters are compared. (The choice of a comparison
\  algorithm will be discussed below.) If there is a mismatch, the text character
\  just beyond the pattern is examined, and the corresponding shift obtained from the
\  shift-table. The pattern is then advanced that number of characters. The process
\  is repeated until the pattern is matched or the text-string is exhausted.

\ An important characteristic of Sunday's algorithm is that the distance advanced
\  depends only on the text character just beyond the search-pattern; it does not
\  depend on the character that failed to match, its position, or the order in which
\  the pattern and text are compared. The algorithm may search front-to-back,
\  back-to-front, or a scrambled order based on letter-frequencies. Naturally, the
\  choice affects performance, as will be discussed in the next section.

\ The comparison algorithm

\ The algorithm for comparing the search-pattern and text (the "comparator") is a
\  deferred word with the stack diagram:

\ COMPARATOR ( c-addrp up c-addrt ut -- n)
\ where n=0 for a match and non-zero for a mismatch. The stack diagram is the same
\  as that of the word COMPARE in the ANSI Standard String Word Set.

\ COMPARATOR should be carefully selected to optimize performance. The standard word
\  COMPARE is a good choice in most cases. Situations where another comparator should
\  be used include: 1) when certain sets of characters are regarded as equivalent,
\  for example, when case is to be ignored; 2) where wild-cards are allowed; and 3)
\  where character frequencies dictate a more efficient search order. In general, the
\  search is fastest when rare characters are matched first.

\ Efficiency

\ The efficiency of the search can be measured by the number of comparisons. The most
\  favorable case is seen when the first character in the pattern is never present in
\  the text and the text character just beyond the pattern never happens to be
\  present in the pattern; the number of comparisons is then ut/(up+1). At the other
\  extreme, there are degenerate cases in which the number of comparisons approaches
\  ut*up. This would occur when using a front-to-back comparator if the text and
\  search patterns contain long runs of the final character in the search pattern,
\  for example, searching for AAAABA in AAA...AAAAABA. This is not merely a
\  theoretical problem, as it might occur in searching for a string embedded in
\  spaces, or in searching graphics files. In the example, performance would be
\  improved by matching the odd character first, but the number of comparisons could
\  not be reduced much below nt. If there is a significant chance that patterns like
\  these might occur, Sunday's algorithm probably should not be used.

\ Requirements
\  This is an ANS Forth program requiring:
\   1. A standard system with Core and Core extension sets.
\   2. String extension set if COMPARE is used to compare strings.
\   3. Uses the words V: and DEFINES from the FSL utilities to
\      manage vectored words
\   4. The compilation of the test code is controlled by the
\      VALUE TEST-CODE? and the conditional compilation words in
\      the Programming-Tools wordset
\   5. The test code uses the String wordset word /STRING
\ ( 6. The test code uses the (non ANS) words:
\        TAB to emit a tab character and,
\        DUP>R which is equivalent to:
\        : DUP>R   DUP >R ;  )

( TAB and DUP>R removed  29Jan2005  cgm )

\   This code is released to the public domain. Leonard Morgenstern, March 1995

include lib/compare.4th
include lib/constant.4th

\ SUNDAY QUICKSEARCH ALGORITHM
\ ANSI STANDARD
\ REQUIRES CORE, CORE EXTENSION, AND STRING EXTENSION

\ SETTING PARAMETERS

256 CONSTANT #SYMBOLS
 \ Number of possible symbols

#SYMBOLS CELLS ARRAY SHIFT-TABLE \ Make shift-table
DEFER COMPARATOR
' COMPARE IS COMPARATOR   \ Default COMPARATOR.

\ Initializing the shift table.

: INIT-TABLE ( c-addr1 u -- )
       DUP 1+
       SHIFT-TABLE #SYMBOLS CELLS OVER + SWAP
       DO      DUP I ! 1 CELLS +LOOP
       DROP
       DUP 0 2SWAP OVER + SWAP
       DO
               I C@ CELLS SHIFT-TABLE +
               >R 2DUP - R> !
               1+
       LOOP
       2DROP
;

\ QUICKSEARCH


\ In this version, certain quantities are stored in VALUEs
\ For many applications, local variables would be better
0 VALUE QS-TA     0 VALUE QS-TL
0 VALUE QS-PA     0 VALUE QS-PL

: QUICKSEARCH ( c-addr[p] u[p] c-addr[t] u[t] -- c-addr[m])
\ Enter with addr & len of pattern & text
\ Return address of match or NULL
       TO QS-TL TO QS-TA  \ Store text coordinates
       2DUP INIT-TABLE \ Initialize shift-table
       TO QS-PL TO QS-PA       \ Store pattern coordinates
       NULL     \ NULL on stack
       QS-TA QS-TL QS-PL - 1+  \ Maximum number of comparisons
       OVER + SWAP   \ Substitute BOUNDS if defined
       DO
               I QS-PL QS-PA QS-PL
               COMPARATOR 0= \ Make a comparison
               IF
                       DROP I LEAVE
      \ Leave if found
               THEN
               I QS-PL + C@ \ Extract char just past pattern
               CELLS SHIFT-TABLE + @
      \ Extract shift
       +LOOP
;

\ END OF ALGORITHM



\ BELOW HERE ARE SEVERAL TEST UTILITIES, ETC.
\ =======================================================================

TRUE [IF]
\ test code =============================================

: DUMP-ST ( DUMP SHIFT TABLE)
       #SYMBOLS 0
       DO
          I 16 MOD 0= IF CR  I EMIT SPACE THEN
          I CELLS SHIFT-TABLE + @ . SPACE
          I 128 MOD 0= IF REFILL DROP THEN
       LOOP
;

40 STRING pattern$
40 STRING text$

: Show-result ( a -- ) \ After performing search show result
    CR  \  1 TAB
    DUP NULL =
    IF
        DROP ." NO MATCH!"
    ELSE
        >R              \           \ stash match addr on r.s.
        QS-TA QS-TL     \ ta tl     \ get coordinates of text
        OVER R> OVER -  \ ta tl ta len1   \ get coordx of 1st part
        dup >r
        TYPE [CHAR] ^ EMIT \ ta tl
        R> /STRING      \ a' l'     \ index to match
        OVER QS-PL
        TYPE [CHAR] ^ EMIT \ a' l'  \ type matching part
        QS-PL /STRING TYPE  \       \ type tail
    THEN
    CR
;
: QTEST ( -- a )  \ Ask for strings, compare, & return result
    ." Look for: "
    pattern$ dup 39 accept
    ."       in: "
    text$    dup 40 accept
    QUICKSEARCH
    Show-result
;

QTEST
[THEN]
